Term Statistics for Structured Text Retrieval

نویسنده

  • Mounia Lalmas
چکیده

SYNONYM Within-element term frequency, Inverse element frequency DEFINITION Classical ranking algorithms in information retrieval make use of term statistics, the most common (and basic) ones being within-document term frequency, tf, and document frequency, df. tf is the number of occurrences of a term in a document and is used to reflect how well a term captures the topic of a document, whereas df is the number of documents in which a term appears and is used to reflect how well a term discriminates between relevant and non-relevant documents. df is also commonly referred to as inverse document frequency, idf, since it is inversely related to the importance of a term. Both tf and idf are obtained at indexing time. Ranking algorithms for structured text retrieval, and more precisely XML retrieval, require similar terms statistics, but with respect to elements. MAIN TEXT To calculate term statistics for elements, one could simply replace documents by elements and calculate so-called within-element term frequency, etf, and inverse element frequency, ief. This however raises an issue because of the nested nature of, in particular, XML documents. For instance, suppose that a section element is composed of two paragraph elements. The fact that a term appears in the paragraph necessitates that it also appears in the section. This overlap can be taken into account when calculating the ief value of a term. In structured retrieval, in contrast to " flat " document retrieval, there are no a priori fixed retrieval units. The whole document, a part of it (e.g. one of its section), or a part of a part (e.g. a paragraph in the section), all constitute potential answers to queries. The simplest approach to allow the retrieval of elements at any level of granularity is to index all elements. Each element thus corresponds to a document, and etf and ief for each element are calculated based on the concatenation of the text of the element and that of its descendants (e.g. [4]). With respect to the calculation of the inverse element frequency, ief, the above approach ignores the issue of nested elements. Indeed, the ief value of a term will consider both the element that contains that term and all elements that do so in virtue of being ancestors of that element. Alternatively, ief can be estimated across elements of the same type (e.g. [3]) or across documents (e.g. [1]). The former greatly reduces the …

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Presenting Structured Text Retrieval Results

DEFINITION Presenting structured text retrieval results refers to the fact that, in structured text retrieval, results are not independent and a judgment on their relevance needs to take their presentation into account. For example, HTML/XML/SGML documents contain a range of nested sub-trees that are fully contained in their ancestor elements. As a result, structured text retrieval should make ...

متن کامل

Presenting Semi-Structured Text Retrieval Results

DEFINITION Presenting semi-structured text retrieval results refers to the fact that, in semi-structured text retrieval, results are not independent and a judgment on their relevance needs to take their presentation into account. For example, HTML/XML/SGML documents contain a range of nested sub-trees that are fully contained in their ancestor elements. As a result, semi-structured text retriev...

متن کامل

Image retrieval using the combination of text-based and content-based algorithms

Image retrieval is an important research field which has received great attention in the last decades. In this paper, we present an approach for the image retrieval based on the combination of text-based and content-based features. For text-based features, keywords and for content-based features, color and texture features have been used. Query in this system contains some keywords and an input...

متن کامل

Using Structured Queries for Disambiguation in Cross-Language Information Retrieval

Bilingual transthr dictionaries are an important resource for query translation in cross-language text retrieval. However, term translation is not an isomorphic process, so dictionary-based systems must address the problem of ambiguity in language translation. In this paper, we claim that boolea~l conjunction (the AND operator) provides siml)le and automatic disambiguation in the target languag...

متن کامل

Integrating a Structured-Text Retrieval System with an Object-Oriented Database System

We describe the integration of a structured-text retrieval system (TextMachine) into an object-oriented database system (OpenODB). Our approach is a light-weight one, using the external function capability of the database system to encapsulate the text retrieval system as an external information source. Yet, we are able to provide a tight integration in the query language and processing; the us...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009